CS 501 HW 2

Author

Hayden Coke and Hodan Aden

Published

September 23, 2025

Plotting Practice

We will use the classic diamonds dataset in today’s exercise (available as part of data packages in both R and Python, and on kaggle). You can find information about each of the variables below.

  • price: price in US dollars ($326–$18,823)
  • carat: weight of the diamond (0.2–5.01)
  • cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color: diamond colour, from J (worst) to D (best)
  • clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x: length in mm (0–10.74)
  • y: width in mm (0–58.9)
  • z: depth in mm (0–31.8)
  • depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • table: width of top of diamond relative to widest point (43–95)

Instructions

  1. Use describe() and sns.pairplot() to examine overall trends in the data. Describe what you observe using markdown text.

  2. Which variables seem to influence each other? Make an informed guess about the direction of causality between these variables (you can refer to the wikipedia page on Diamonds if you find it helpful).

  3. Choose any set of appropriate variables to generate each of the following plots. You can refer to the seaborn gallery for examples.

  • Scatter plot with a single pair of variables
  • Scatter plot of two variables with a third variable encoded using color
  • Bar plot with two variables
  • Grouped bar plot of two variables with a third variable encoded using color
  • Facet grid showing multiple levels of an ordinal variable on one axis
  • Violin plot showing a single continuous numeric variable on the y axis and a categorical variable on the x axis

Answers

Your text and code goes here! Use markdown to nicely format your plots and answers.

# import libraries
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# import dataset
df = pd.read_csv('./../../datasets/diamonds.csv')
df
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

Generating plots

Scatterplot of length (x) and carat

g1 = sns.scatterplot(df, x='x', y='carat')
plt.title("Length vs Carat")
plt.show()

Scatter plot of two variables with a third variable encoded using color

g2 = sns.scatterplot(df, x='carat', y='price', hue='color')
plt.show()

Bar plot of cut and price

g3 = sns.barplot(df, x='cut', y='price', 
                order=['Fair','Good','Very Good', 'Ideal','Premium'])
plt.title("Cut vs Price")
plt.show()

Grouped bar plot

g4 = sns.catplot(df, kind="bar",
    x="clarity", y="price", hue="cut")
plt.show()

Facet grid

g5 = sns.FacetGrid(df, col='clarity')
g5.map(sns.scatterplot, 'x', 'carat')
plt.show()

Violin plot of price (continuous numeric) and clarity (categorical ordinal)

g6 = sns.violinplot(df, x='clarity', y='price',
                    order=['VVS1','VVS2','VS1','VS2','SI1','SI2','I1'])
plt.show()

Extra Credit

  • Choose two plots from the seaborn gallery that we haven’t already used and which are relevant to your EDA. Re-create them here using the diamonds dataset.
  • Explain why these are appropriate plots for this type of data. What trends or insights are visible?

Stacked histogram

Based on this seaborn gallery example

f, ax = plt.subplots(figsize=(7, 5))
sns.despine(f)
sns.set_theme(style="ticks")
sns.histplot(
    df,
    x="price", hue="cut",
    multiple="stack",
    edgecolor=".3",
    linewidth=.5,
    # log_scale=True,
)
ax.xaxis.set_major_formatter(mpl.ticker.ScalarFormatter())
ax.set_xticks([500, 1000, 2000, 5000, 10000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=-90)
plt.show()

This plot also uses the diamond dataset and I wanted to see what it would look like without the log scaling. While the log scaling makes the plot more readable and look better, the linear scaling is more intuitive to me.

Based on this plot, it looks like most of the prices in this dataset are in the 500-1500 range. This may mean that most diamonds sell for this range, but to confirm that would take more analysis. This range is also where a majority of the “Ideal” cut diamonds are.

Boxplots

Based on this seaborn gallery example

with logarithmic scale:

sns.boxplot(df, x="cut", y="price", log_scale=True)
sns.despine(offset=10, trim=True)
plt.show()

without log scale:

sns.boxplot(df, x="cut", y="price")
sns.despine(offset=10, trim=True)
plt.show()

This is another plot that looks very different with log scaling. The plot without log scaling shows so many outlier dots on the upper end of price that it makes the whole plot look weird. On both plots, we can see that the median price for each cut is around the same price (roughly 2500). Something I don’t really understand is why the upper range (the horizontal line on top of each boxplot) looks like it’s the same across cuts on the log scale plot, but different on the linear scale plot.